The idea of the recipes package is to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. “feature engineering”) before we build our models.

Import data and split the data into training and testing sets using initial_split()

library(tidyverse)
library(tidymodels)

ames <- read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/ames.csv")

ames <- ames %>%
 select(-matches("Qu"))

set.seed(100)

new_split <- initial_split(ames) 
new_train <- training(new_split) 
new_test <- testing(new_split)

Next, we use a recipe() to build a set of steps for data preprocessing and feature engineering.

  • First, we must tell the recipe() what our model is going to be (using a formula here) and what our training data is.
  • step_novel() will convert all nominal variables to factors.
  • We then convert the factor columns into (one or more) numeric binary (0 and 1) variables for the levels of the training data.
  • We remove any numeric variables that have zero variance.
  • We normalize (center and scale) the numeric variables.
  • Finally, we prep() the recipe(). This means we actually do something with the steps and our training data.
ames_rec <-
  recipe(Sale_Price ~ ., data = new_train) %>%
  step_novel(all_nominal(), -all_outcomes()) %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors()) %>% 
  prep() # put recipe into action

# Show the content of our recipe
ames_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         73
## 
## Training data contained 2198 data points and no missing data.
## 
## Operations:
## 
## Novel factor level assignment for MS_SubClass, MS_Zoning, Street, ... [trained]
## Dummy variables from MS_SubClass, MS_Zoning, Street, Alley, ... [trained]
## Zero variance filter removed MS_SubClass_new, ... [trained]
## Centering and scaling for Lot_Frontage, Lot_Area, ... [trained]

Print a summary of our recipe:

summary(ames_rec)

To obtain the Dataframe from the recipe, we use the function juice():

juice(ames_rec)

We now can simply apply all of the recipe transformations to the testing data. The function to perform this is called bake():

test_trans <- bake(ames_rec, new_data = new_test)

Now it’s time to specify and then fit our models.

juice(hotel_rec). The recipe hotel_rec contains all our transformations for data preprocessing and feature engineering, as well as the data these transformations were estimated from. When we juice() the recipe, we squeeze that training data back out, transformed in the ways we specified including the downsampling.

lm_spec <- 
  linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode(mode = "regression")

lm_fit <- 
  lm_spec %>%
  fit(Sale_Price ~ . , data = juice(ames_rec))

0.1 Evaluate models

set.seed(100)

cv_folds <-
 vfold_cv(juice(ames_rec), 
          v = 10, 
          strata = Sale_Price,
          breaks = 4) 

lm_res <-
  lm_spec %>% 
  fit_resamples(
    Sale_Price ~ .,
    resamples = cv_folds
    )

lm_res %>% 
  collect_metrics()

1 Evaluate final model

Finally, let’s use our testing data and see how we can expect this model to perform on new data.

lm_fit %>% 
 predict(test_trans) %>%
 mutate(truth = test_trans$Sale_Price) %>%
 rmse(truth, .pred)

2 Detaild discussion

Let’s have a closer look at the different components of the recipe.

2.1 recipe()

First of all, we created a simple recipe (we call it rec) containing only an outcome (Sale_Price) and predictors (all other variables in the dataset: .). To demonstrate the use of recipes step by step, we create a new object with the name rec:

rec <- recipe(Sale_Price ~ ., data = ames)

The formula Sale_Price ~ . indicates outcomes vs predictors.

2.2 Helper functions

Here some helper functions for selecting sets of variables:

  • all_predictors(): Each x variable (right side of ~)
  • all_outcomes(): Each y variable (left side of ~)
  • all_numeric(): Each numeric variable
  • all_nominal(): Each categorical variable (e.g. factor, string)
  • dplyr::select() helpers starts_with(‘Lot_’), etc.

2.3 step_novel()

step_novel() will convert all nominal variables to factors. It adds a catch-all level to a factor for any new values, which lets R intelligently predict new levels in the test set. Missing values will remain missing.

rec %>%
  step_novel(all_nominal(), -all_outcomes())
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         73
## 
## Operations:
## 
## Novel factor level assignment for all_nominal(), -all_outcomes()

2.4 step_dummy()

Converts nominal data into dummy variables.

rec %>%
 step_dummy(all_nominal())
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         73
## 
## Operations:
## 
## Dummy variables from all_nominal()

2.5 step_zv()

step_zv() removes zero variance variables (variables that contain only a single value).

rec %>%
  step_zv(all_predictors())
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         73
## 
## Operations:
## 
## Zero variance filter on all_predictors()

2.6 step_normalize()

Centers then scales numeric variable (mean = 0, sd = 1)

rec %>%
  step_normalize(all_numeric())
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         73
## 
## Operations:
## 
## Centering and scaling for all_numeric()

3 Workflows

To combine all of the steps discussed above, we could use the package workflows. A workflow is an object that can bundle together your pre-processing, modeling, and post-processing requests.

new_wf <-
 workflow() %>%
 add_recipe(ames_rec) %>%
 add_model(lm_spec)

new_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## ● step_novel()
## ● step_dummy()
## ● step_zv()
## ● step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
LS0tCnRpdGxlOiAiVGlkeW1vZGVscyBJSUk6IEJ1aWxkIE1vZGVscyB3aXRoIFJlY2lwZXMiCnN1YnRpdGxlOiAiTGVhcm4gaG93IHRvIGJ1aWxkIG1vZGVscyB3aXRoIHRpZHltb2RlbHMiCmF1dGhvcjogIlByb2YuIERyLiBKYW4gS2lyZW56IgpvdXRwdXQ6CiBodG1sX2RvY3VtZW50OgogIGNvZGVfZG93bmxvYWQ6IHRydWUgCiAgY3NzOiBzdHlsZS5jc3MgCiAgZmlnX2hlaWdodDogNgogIGZpZ193aWR0aDogOAogIGhpZ2hsaWdodDogdGFuZ28KICBudW1iZXJfc2VjdGlvbnM6IHllcwogIHRoZW1lOiBwYXBlcgogIHRvYzogeWVzCiAgdG9jX2RlcHRoOiAzCiAgdG9jX2Zsb2F0OiAKICAgIGNvbGxhcHNlZDogZmFsc2UKICAgIHNtb290aF9zY3JvbGw6IHRydWUgCiAgZGZfcHJpbnQ6IHBhZ2VkCi0tLQoKYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9CmtuaXRyOjpvcHRzX2NodW5rJHNldCgKCWVjaG8gPSBUUlVFLAoJbWVzc2FnZSA9IEZBTFNFLAoJd2FybmluZyA9IEZBTFNFCikKbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkodGlkeW1vZGVscykKYGBgCgpUaGUgaWRlYSBvZiB0aGUgW3JlY2lwZXMgcGFja2FnZV0oaHR0cHM6Ly9yZWNpcGVzLnRpZHltb2RlbHMub3JnKSBpcyB0byBkZWZpbmUgYSByZWNpcGUgb3IgYmx1ZXByaW50IHRoYXQgY2FuIGJlIHVzZWQgdG8gc2VxdWVudGlhbGx5IGRlZmluZSB0aGUgZW5jb2RpbmdzIGFuZCBwcmVwcm9jZXNzaW5nIG9mIHRoZSBkYXRhIChpLmUuIOKAnGZlYXR1cmUgZW5naW5lZXJpbmfigJ0pIGJlZm9yZSB3ZSBidWlsZCBvdXIgbW9kZWxzLgoKSW1wb3J0IGRhdGEgYW5kIHNwbGl0IHRoZSBkYXRhIGludG8gdHJhaW5pbmcgYW5kIHRlc3Rpbmcgc2V0cyB1c2luZyBgaW5pdGlhbF9zcGxpdCgpYAoKYGBge3J9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KHRpZHltb2RlbHMpCgphbWVzIDwtIHJlYWRfY3N2KCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20va2lyZW56L2RhdGFzZXRzL21hc3Rlci9hbWVzLmNzdiIpCgphbWVzIDwtIGFtZXMgJT4lCiBzZWxlY3QoLW1hdGNoZXMoIlF1IikpCgpzZXQuc2VlZCgxMDApCgpuZXdfc3BsaXQgPC0gaW5pdGlhbF9zcGxpdChhbWVzKSAKbmV3X3RyYWluIDwtIHRyYWluaW5nKG5ld19zcGxpdCkgCm5ld190ZXN0IDwtIHRlc3RpbmcobmV3X3NwbGl0KQoKYGBgCgpOZXh0LCB3ZSB1c2UgYSBgcmVjaXBlKClgIHRvIGJ1aWxkIGEgc2V0IG9mIHN0ZXBzIGZvciBkYXRhIHByZXByb2Nlc3NpbmcgYW5kIGZlYXR1cmUgZW5naW5lZXJpbmcuCgoqIEZpcnN0LCB3ZSBtdXN0IHRlbGwgdGhlIGByZWNpcGUoKWAgd2hhdCBvdXIgbW9kZWwgaXMgZ29pbmcgdG8gYmUgKHVzaW5nIGEgZm9ybXVsYSBoZXJlKSBhbmQgd2hhdCBvdXIgdHJhaW5pbmcgZGF0YSBpcy4KKiBgc3RlcF9ub3ZlbCgpYCB3aWxsIGNvbnZlcnQgYWxsIG5vbWluYWwgdmFyaWFibGVzIHRvIGZhY3RvcnMuCiogV2UgdGhlbiBjb252ZXJ0IHRoZSBmYWN0b3IgY29sdW1ucyBpbnRvIChvbmUgb3IgbW9yZSkgbnVtZXJpYyBiaW5hcnkgKDAgYW5kIDEpIHZhcmlhYmxlcyBmb3IgdGhlIGxldmVscyBvZiB0aGUgdHJhaW5pbmcgZGF0YS4KKiBXZSByZW1vdmUgYW55IG51bWVyaWMgdmFyaWFibGVzIHRoYXQgaGF2ZSB6ZXJvIHZhcmlhbmNlLgoqIFdlIG5vcm1hbGl6ZSAoY2VudGVyIGFuZCBzY2FsZSkgdGhlIG51bWVyaWMgdmFyaWFibGVzLiAKLSBGaW5hbGx5LCB3ZSBgcHJlcCgpYCB0aGUgYHJlY2lwZSgpYC4gVGhpcyBtZWFucyB3ZSBhY3R1YWxseSBkbyBzb21ldGhpbmcgd2l0aCB0aGUgc3RlcHMgYW5kIG91ciB0cmFpbmluZyBkYXRhLgoKCmBgYHtyfQoKYW1lc19yZWMgPC0KICByZWNpcGUoU2FsZV9QcmljZSB+IC4sIGRhdGEgPSBuZXdfdHJhaW4pICU+JQogIHN0ZXBfbm92ZWwoYWxsX25vbWluYWwoKSwgLWFsbF9vdXRjb21lcygpKSAlPiUKICBzdGVwX2R1bW15KGFsbF9ub21pbmFsKCkpICU+JQogIHN0ZXBfenYoYWxsX3ByZWRpY3RvcnMoKSkgJT4lCiAgc3RlcF9ub3JtYWxpemUoYWxsX3ByZWRpY3RvcnMoKSkgJT4lIAogIHByZXAoKSAjIHB1dCByZWNpcGUgaW50byBhY3Rpb24KCiMgU2hvdyB0aGUgY29udGVudCBvZiBvdXIgcmVjaXBlCmFtZXNfcmVjCgpgYGAKClByaW50IGEgc3VtbWFyeSBvZiBvdXIgcmVjaXBlOgoKYGBge3J9CgpzdW1tYXJ5KGFtZXNfcmVjKQoKYGBgCgpUbyBvYnRhaW4gdGhlIERhdGFmcmFtZSBmcm9tIHRoZSByZWNpcGUsIHdlIHVzZSB0aGUgZnVuY3Rpb24gYGp1aWNlKClgOgoKYGBge3J9CgpqdWljZShhbWVzX3JlYykKCmBgYAoKCldlIG5vdyBjYW4gc2ltcGx5IGFwcGx5IGFsbCBvZiB0aGUgcmVjaXBlIHRyYW5zZm9ybWF0aW9ucyB0byB0aGUgdGVzdGluZyBkYXRhLiBUaGUgZnVuY3Rpb24gdG8gcGVyZm9ybSB0aGlzIGlzIGNhbGxlZCBgYmFrZSgpYDoKCgpgYGB7cn0KCnRlc3RfdHJhbnMgPC0gYmFrZShhbWVzX3JlYywgbmV3X2RhdGEgPSBuZXdfdGVzdCkKCmBgYAoKCk5vdyBpdCdzIHRpbWUgdG8gKipzcGVjaWZ5KiogYW5kIHRoZW4gKipmaXQqKiBvdXIgbW9kZWxzLiAKCmBqdWljZShob3RlbF9yZWMpYC4gVGhlIHJlY2lwZSBgaG90ZWxfcmVjYCBjb250YWlucyBhbGwgb3VyIHRyYW5zZm9ybWF0aW9ucyBmb3IgZGF0YSBwcmVwcm9jZXNzaW5nIGFuZCBmZWF0dXJlIGVuZ2luZWVyaW5nLCAqYXMgd2VsbCBhcyogdGhlIGRhdGEgdGhlc2UgdHJhbnNmb3JtYXRpb25zIHdlcmUgZXN0aW1hdGVkIGZyb20uIFdoZW4gd2UgYGp1aWNlKClgIHRoZSByZWNpcGUsIHdlIHNxdWVlemUgdGhhdCB0cmFpbmluZyBkYXRhIGJhY2sgb3V0LCB0cmFuc2Zvcm1lZCBpbiB0aGUgd2F5cyB3ZSBzcGVjaWZpZWQgaW5jbHVkaW5nIHRoZSBkb3duc2FtcGxpbmcuIAoKYGBge3J9CgpsbV9zcGVjIDwtIAogIGxpbmVhcl9yZWcoKSAlPiUgCiAgc2V0X2VuZ2luZSgibG0iKSAlPiUgCiAgc2V0X21vZGUobW9kZSA9ICJyZWdyZXNzaW9uIikKCmxtX2ZpdCA8LSAKICBsbV9zcGVjICU+JQogIGZpdChTYWxlX1ByaWNlIH4gLiAsIGRhdGEgPSBqdWljZShhbWVzX3JlYykpCgpgYGAKCgojIyBFdmFsdWF0ZSBtb2RlbHMKCmBgYHtyfQoKc2V0LnNlZWQoMTAwKQoKY3ZfZm9sZHMgPC0KIHZmb2xkX2N2KGp1aWNlKGFtZXNfcmVjKSwgCiAgICAgICAgICB2ID0gMTAsIAogICAgICAgICAgc3RyYXRhID0gU2FsZV9QcmljZSwKICAgICAgICAgIGJyZWFrcyA9IDQpIAoKbG1fcmVzIDwtCiAgbG1fc3BlYyAlPiUgCiAgZml0X3Jlc2FtcGxlcygKICAgIFNhbGVfUHJpY2UgfiAuLAogICAgcmVzYW1wbGVzID0gY3ZfZm9sZHMKICAgICkKCmxtX3JlcyAlPiUgCiAgY29sbGVjdF9tZXRyaWNzKCkKCmBgYAoKCiMgRXZhbHVhdGUgZmluYWwgbW9kZWwKCkZpbmFsbHksIGxldCdzIHVzZSBvdXIgdGVzdGluZyBkYXRhIGFuZCBzZWUgaG93IHdlIGNhbiBleHBlY3QgdGhpcyBtb2RlbCB0byBwZXJmb3JtIG9uIG5ldyBkYXRhLgoKYGBge3J9CgpsbV9maXQgJT4lIAogcHJlZGljdCh0ZXN0X3RyYW5zKSAlPiUKIG11dGF0ZSh0cnV0aCA9IHRlc3RfdHJhbnMkU2FsZV9QcmljZSkgJT4lCiBybXNlKHRydXRoLCAucHJlZCkKCmBgYAoKCgoKIyBEZXRhaWxkIGRpc2N1c3Npb24KCkxldCdzIGhhdmUgYSBjbG9zZXIgbG9vayBhdCB0aGUgZGlmZmVyZW50IGNvbXBvbmVudHMgb2YgdGhlIHJlY2lwZS4KCiMjIHJlY2lwZSgpCgpGaXJzdCBvZiBhbGwsIHdlIGNyZWF0ZWQgYSBzaW1wbGUgcmVjaXBlICh3ZSBjYWxsIGl0IGByZWNgKSBjb250YWluaW5nIG9ubHkgYW4gb3V0Y29tZSAoYFNhbGVfUHJpY2VgKSBhbmQgcHJlZGljdG9ycyAoYWxsIG90aGVyIHZhcmlhYmxlcyBpbiB0aGUgZGF0YXNldDogYC5gKS4gVG8gZGVtb25zdHJhdGUgdGhlIHVzZSBvZiByZWNpcGVzIHN0ZXAgYnkgc3RlcCwgd2UgY3JlYXRlIGEgbmV3IG9iamVjdCB3aXRoIHRoZSBuYW1lIGByZWNgOgoKYGBge3J9CgpyZWMgPC0gcmVjaXBlKFNhbGVfUHJpY2UgfiAuLCBkYXRhID0gYW1lcykKCmBgYAoKVGhlIGZvcm11bGEgYFNhbGVfUHJpY2UgfiAuYCBpbmRpY2F0ZXMgb3V0Y29tZXMgdnMgcHJlZGljdG9ycy4KCiMjIEhlbHBlciBmdW5jdGlvbnMKCkhlcmUgc29tZSBoZWxwZXIgZnVuY3Rpb25zIGZvciBzZWxlY3Rpbmcgc2V0cyBvZiB2YXJpYWJsZXM6CgoqIGBhbGxfcHJlZGljdG9ycygpYDogRWFjaCB4IHZhcmlhYmxlIChyaWdodCBzaWRlIG9mIH4pCiogYGFsbF9vdXRjb21lcygpYDogRWFjaCB5IHZhcmlhYmxlIChsZWZ0IHNpZGUgb2YgfikKKiBgYWxsX251bWVyaWMoKWA6IEVhY2ggbnVtZXJpYyB2YXJpYWJsZQoqIGBhbGxfbm9taW5hbCgpYDogRWFjaCBjYXRlZ29yaWNhbCB2YXJpYWJsZSAoZS5nLiBmYWN0b3IsIHN0cmluZykKKiBgZHBseXI6OnNlbGVjdCgpYCBoZWxwZXJzIHN0YXJ0c193aXRoKCdMb3RfJyksIGV0Yy4KCgojIyBzdGVwX25vdmVsKCkKCltgc3RlcF9ub3ZlbCgpYF0oaHR0cHM6Ly9yZWNpcGVzLnRpZHltb2RlbHMub3JnL3JlZmVyZW5jZS9zdGVwX25vdmVsLmh0bWwpIHdpbGwgY29udmVydCBhbGwgbm9taW5hbCB2YXJpYWJsZXMgdG8gZmFjdG9ycy4gSXQgYWRkcyBhIGNhdGNoLWFsbCBsZXZlbCB0byBhIGZhY3RvciBmb3IgYW55IG5ldyB2YWx1ZXMsIHdoaWNoIGxldHMgUiBpbnRlbGxpZ2VudGx5IHByZWRpY3QgbmV3IGxldmVscyBpbiB0aGUgdGVzdCBzZXQuIE1pc3NpbmcgdmFsdWVzIHdpbGwgcmVtYWluIG1pc3NpbmcuCgpgYGB7cn0KCnJlYyAlPiUKICBzdGVwX25vdmVsKGFsbF9ub21pbmFsKCksIC1hbGxfb3V0Y29tZXMoKSkKCmBgYAoKIyMgc3RlcF9kdW1teSgpCgpDb252ZXJ0cyBub21pbmFsIGRhdGEgaW50byBkdW1teSB2YXJpYWJsZXMuCgpgYGB7cn0KCnJlYyAlPiUKIHN0ZXBfZHVtbXkoYWxsX25vbWluYWwoKSkKCmBgYAoKCiMjIHN0ZXBfenYoKQoKYHN0ZXBfenYoKWAgcmVtb3ZlcyB6ZXJvIHZhcmlhbmNlIHZhcmlhYmxlcyAodmFyaWFibGVzIHRoYXQgY29udGFpbiBvbmx5IGEgc2luZ2xlIHZhbHVlKS4gCgpgYGB7cn0KCnJlYyAlPiUKICBzdGVwX3p2KGFsbF9wcmVkaWN0b3JzKCkpCgpgYGAKCgojIyBzdGVwX25vcm1hbGl6ZSgpCgpDZW50ZXJzIHRoZW4gc2NhbGVzIG51bWVyaWMgdmFyaWFibGUgKG1lYW4gPSAwLCBzZCA9IDEpCgpgYGB7cn0KCnJlYyAlPiUKICBzdGVwX25vcm1hbGl6ZShhbGxfbnVtZXJpYygpKQoKYGBgCgoKIyBXb3JrZmxvd3MKClRvIGNvbWJpbmUgYWxsIG9mIHRoZSBzdGVwcyBkaXNjdXNzZWQgYWJvdmUsIHdlIGNvdWxkIHVzZSB0aGUgcGFja2FnZSBbd29ya2Zsb3dzXShodHRwczovL3dvcmtmbG93cy50aWR5bW9kZWxzLm9yZykuIEEgd29ya2Zsb3cgaXMgYW4gb2JqZWN0IHRoYXQgY2FuIGJ1bmRsZSB0b2dldGhlciB5b3VyIHByZS1wcm9jZXNzaW5nLCBtb2RlbGluZywgYW5kIHBvc3QtcHJvY2Vzc2luZyByZXF1ZXN0cy4gCgoKYGBge3J9Cm5ld193ZiA8LQogd29ya2Zsb3coKSAlPiUKIGFkZF9yZWNpcGUoYW1lc19yZWMpICU+JQogYWRkX21vZGVsKGxtX3NwZWMpCgpuZXdfd2YKCmBgYAo=